Passengers wait to board an arriving NJ Transit Train
What if we could tell you that your train was going to be 6 minutes late for your Tuesday morning commute, ten days before it was time to leave rather than ten minutes? Would that knowledge be valuable to you?
Although train delays may seem like minor annoyances in the daily life of a commuter, there are real economic costs to train tardiness. The city of New York Office of the Controller found that major delays (5-minute delays) annually cost the city $170.2 million dollars (Stringer, 2017). While some systems are truer to their timetables, systems like Amtrak are notoriously late (Lazo, 2019). In the case of Amtrak, a primary driver of lateness is congestion as it does not own most of the right of way it operates on. In a 2019 report detailing its on-time-performance (OTP) woes, Amtrak estimated approximately $171 million in losses during 2018 due to 27% of their trains being late (Amtrak, 2019).
OTP issues are not exclusive to Amtrak in the Northeast. Regional operators such as the NJ Transit have a history of delays (McGeehan, 2018). For the 90,000 daily riders of NJ transit, these delays mean lost time, revenue, and sanity. In this study, we build and test a linear regression to predict delay length for trains based on a variety of predictors. The hope is that this model can be utilized by agencies, researchers, and the average commuter.
For agencies and researchers, the visualizations and algorithm can serve to identify bottlenecks in the system. Whether that be the stations that are coming from or arriving to or if there is a particular day of the week that is most late. This analysis will lead to the identification of lateness so that it can be addressed. For the average commuter, this analysis can provide a way to determine average lateness and given a set of variables; how late will the train be?
import <- paste0("Amtrak/",list.files("Amtrak"))
df <- data.frame()
############ TO RUN FULL MODEL, UNCOMMENT THIS CODE ############
# for (i in 2:nrow(as_tibble(import))) {
# df <- rbind(df,read.csv(import[i]))
# }
############ TO RUN FULL MODEL, UNCOMMENT THIS CODE ############
############ TO RUN SAFE MODEL, UNCOMMENT THIS CODE ############
for (i in 2:nrow(as_tibble(import))) {
if(grepl("2020",import[i])){
df <- rbind(df,read.csv(import[i]))
} else {
# print(paste(import[i], "not added"))
}
}
############ TO RUN SAFE MODEL, UNCOMMENT THIS CODE ############
# Stations There's no complete set of NJ transit lines anymore ---------------------------
NJTransitLines <- bind_rows(st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_13.geojson"),
st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_9.geojson"),
st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_3.geojson"),
st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_4.geojson"),
st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_1.geojson"),
st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_5.geojson"),
st_read("https://opendata.arcgis.com/datasets/fc24118ac4a0445da5e09f7692075757_0.geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/2/query?outFields=*&where=1%3D1&f=geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/6/query?outFields=*&where=1%3D1&f=geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/7/query?outFields=*&where=1%3D1&f=geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/8/query?outFields=*&where=1%3D1&f=geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/10/query?outFields=*&where=1%3D1&f=geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/11/query?outFields=*&where=1%3D1&f=geojson"),
st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Linea/FeatureServer/12/query?outFields=*&where=1%3D1&f=geojson")) %>%
select(!c("OBJECTID","LINE_CODE"))
NJtransitstations <- st_read("https://opendata.arcgis.com/datasets/4809dada94c542e0beff00600ee930f6_0.geojson")
NJshp <- tigris::states() %>%
filter(NAME == "New Jersey")
df <- df %>%
filter(type == "NJ Transit") %>%
mutate(date = ymd(date),
scheduled_time = ymd_hms(scheduled_time),
actual_time = ymd_hms(actual_time),
week = week(date),
DotW = wday(date),
Start_time = as_hms(scheduled_time),
Actual_time = as_hms(actual_time),
AM_PM = ifelse(am(Actual_time) == TRUE, "AM","PM"),
Year = year(date),
# Code below used to get the lag of the station previous in station sequence
# Get train id of previous row
train_lag = lag(train_id),
# Find the delay at that station
time_lag_dumb = lag(delay_minutes),
# Check if current stop is on the same train as the previous stop
train_end = train_lag == train_id,
# If we're still on the same train, remove time_lag
station_lag = ifelse(train_end == TRUE, time_lag_dumb,NA),
accumulated_delay = delay_minutes + station_lag,
quintile_delay = q5(delay_minutes),
ID = row_number()) %>%
mutate(scheduled_TOD = case_when(hour(scheduled_time) < 7 | hour(scheduled_time) > 18 ~ "Overnight",
hour(scheduled_time) >= 7 & hour(scheduled_time) < 10 ~ "AM Rush",
hour(scheduled_time) >= 10 & hour(scheduled_time) < 15 ~ "Mid-Day",
hour(scheduled_time) >= 15 & hour(scheduled_time) <= 18 ~ "PM Rush")) %>%
select(!c("time_lag_dumb","train_end","train_lag")) %>%
rename(scheduled_datetime = scheduled_time,
actual_datetime = actual_time)
# %>%
# mutate(station_link = paste(from, to, sep = "-"))
Analysis_Dates <- seq(min(as.Date(df$date)), max(as.Date(df$date)), by="days")
Analysis_Dates <- as_tibble(Analysis_Dates) %>%
mutate(day = row_number()) %>%
rename(date = value)
df <- left_join(df,Analysis_Dates, by = "date")
# Weather -------------------------------------
#df %>%
# filter(line %in% c("Amtrak","AMTRAK","AMTRAK REGIONAL")) %>%
# View()
#f <- df %>%
# mutate(merge = case_when(from == "New York Penn Station" & to == "New York Penn Station" & line == "Amtrak" ~ "Winner",
# from == "Philadelphia" & to == "New York Penn Station" & line == "Amtrak" ~ "Something",
# from == "New York Penn Station" & to == "Philadelphia" & line == "Amtrak" ~ "Apple"))
weather.Panel <-
riem_measures(station = "EWR", date_start = (as.Date(min(df$date))-5), date_end = (as.Date(max(df$date))+5)) %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
df <- left_join(df %>% mutate(interval60 = floor_date(actual_datetime, unit = "hour")), weather.Panel, by = "interval60")
#Attaching Geography -----------------------------------
df <- df %>%
mutate(clean = ifelse(is.na(Start_time) == TRUE, "remove","keep")) %>%
filter(clean == "keep") %>%
select(!clean) %>%
mutate(LINE_NAME = case_when(line == "Atl. City Line" ~ "Atlantic City Line",
line == "Bergen Co. Line " ~ "Bergen County Line",
# Be careful about the space at the end the Bergen Co. Line - was causing missing vals
line == "Gladstone Branch" ~ "Gladstone Branch",
line == "Main Line" ~ "Main Line",
line == "Montclair-Boonton" ~ "Montclair-Boonton Line",
line == "Morristown Line" ~ "Morristown Line",
line == "No Jersey Coast" ~ "NJ Coast Line",
line == "Northeast Corrdr" ~ "Northeast Corridor",
line == "Pascack Valley" ~ "Pascack Valley Line",
line == "Princeton Shuttle" ~ "Princeton Dinky",
line == "Raritan Valley" ~ "Raritan Valley Line"))
df_shp <- left_join(df, NJTransitLines, by = "LINE_NAME") %>%
st_as_sf()
# Cleaning Geometry ----------------------------------
NJtransitstations[149,2] <- "Middletown NY"
NJtransitstations$STATION_ID <- trimws(NJtransitstations$STATION_ID)
Station_clean <- df_shp %>%
st_drop_geometry() %>%
select(from) %>%
arrange(from) %>%
unique() %>%
mutate(Spelled_correctly = ifelse(from %in% NJtransitstations$STATION_ID, TRUE, FALSE)) %>%
filter(Spelled_correctly == FALSE)
f <- c("Anderson Street-Hackensack",
"Atlantic City",
"Bay Street-Montclair",
"Broadway",
"Essex Street-Hackensack",
"Glen Rock-Boro Hall",
"Glen Rock-Main",
"Hoboken Terminal",
"Middletown",
"Montclair St Univ",
"Mt Arlington",
"Mountain Station",
"Pennsauken Transit Center",
"30th Street Station",
"Princeton Jct.",
"Radburn",
"Ramsey",
"Rte 17 Ramsey",
"Secaucus Junction Upper Level",
"Secaucus Junction Lower Level",
"Secaucus Junction Upper Level",
"Teterboro-Williams Ave",
"Watsessing",
"Wayne Route 23 Transit Center",
"Wood-Ridge")
Station_clean <- bind_cols(Station_clean,f) %>%
rename(clean_name = ...3)
for(i in 1:nrow(Station_clean)) {
df_shp$from <- str_replace(df_shp$from, Station_clean$from[i],Station_clean$clean_name[i])
df_shp$to <- str_replace(df_shp$to, Station_clean$from[i],Station_clean$clean_name[i])
}
Table 1 includes a sample of 500 rows to view the data. Use the filter boxes at the bottom of the table window to filter, select, and order the data.
This model uses data from Kaggle on Amtrak and NJ Transit, shapefile data from ESRI, and weather data from riem. For the data from kaggle we used data specifically from NJ Transit. The NJ Transit is the public transit system for the state of New Jersey and includes buses, trains, and light rail; however, for this study we are only finding delays for trains. The NJ Transit has 11 different train lines and 165 stations. The dataset that we have selected comprises of 870025 and 30 variables. From this dataset, we engineered a number of other variables such as station lag, the delay at the previous station, and the accumulated delay at each stop sequence. In addition to the train data, we also included data on the weather, at the time of the train departing. Weather data was pulled from the nearest station which was Newark International Airport.
The station lag is the magnitude by which a train is late at the station previous to the current station for a given train on a single route. Said differently, if train 123 travels from stations A -> B -> C, and is late by 2 minutes at station B, then the station lag for train 123 at station C is 2 minutes. This feature introduces the ability to account for delays at the previous station as a predictor for how late train will be to the current station. As seen later in the exploratory analyses, delays accumulate as trains get closer to their destination. The spatiotemporal process of accumulated delay is an important predictor of a train’s lateness to a station.
To conceptualize what variables to consider in the model, we explored the data through plot correlations, categorical data, delays by time, delays by station, geography and delays over time. The purpose of these exploratory analyses is to gain a better understanding for the story of lateness. We answer questions such as ‘which lines experience more lateness than others’, ‘during which times of day is lateness more prevalent in the system’, and ‘which stops in a line seem to cause more delay compared to other stops?’ By first examining these trends, we gain a better understanding of the drivers of lateness in NJ Transit’s system.
To explore the correlation in data, we used the DataExplorer package’s plot_correlation() function. This package also plots correlation of categorical data by breaking categorical data into dummy variables. The plot below shows the relationship between all the variables.
DataExplorer::plot_correlation(df %>% select(!line))
To better understand the relationships in categorical data, in the plot below the average delay was plotted for AM/PM, line, scheduled time of day, and status of the train. The plot does indicate that There is not a large difference between trains based on AM/PM; however, there are differences in delay based on the line.
Categorical <- df_shp %>%
st_drop_geometry() %>%
na.omit() %>%
select(where(is.character),delay_minutes)
#colnames(Categorical)
cat <- Categorical %>%
select(colnames(Categorical)[2], delay_minutes) %>%
mutate(Variable = colnames(Categorical[2])) %>%
dplyr::rename("Value" = 1)
for(i in 2:(nrow(as_tibble(colnames(Categorical)))-1)) {
t <- Categorical %>%
select(colnames(Categorical)[i], delay_minutes) %>%
mutate(Variable = colnames(Categorical[i])) %>%
dplyr::rename("Value" = 1)
cat <- bind_rows(cat,t)
}
ggplot(data = cat %>%
filter(!Variable %in% c("from","to","train_id","type","line")),
aes(x = Value, y = delay_minutes)) +
geom_bar(stat = "summary", fun.y = "mean") +
scale_y_continuous(labels = comma) +
facet_wrap(~Variable, scales = "free") +
plotTheme() +
labs(x = "Categorical Variables",
y = "Delay in Minutes",
title = "Cateogrical Variables Average Delay")
#ggsave("Categorical_Variables.jpg", width = 16, height = 9, unit = "in", dpi = 300)
Next we wanted to understand the relationship of lateness over the course of 24 hours. First we categorized the scheduled trains by AM Rush, Mid-Day, PM Rush, and Overnight, and then counted the frequencies of trains that were more than 5 minutes late during each of these periods. Next the overall line delays in the dataset were counted and plotted against the hour in the day. This shows at what time of day how many trains were late.
# A <-
df %>%
mutate(delay_grtr5min = case_when(delay_minutes >= 5 ~ 1,
delay_minutes < 5 ~ 0)) %>%
filter(delay_grtr5min == 1) %>%
group_by(interval60, scheduled_TOD, LINE_NAME) %>%
tally() %>%
summarize(avg_trainsLate = mean(n)) %>%
ggplot() +
geom_histogram(aes(x = avg_trainsLate), binwidth = 1) +
facet_wrap(~factor(scheduled_TOD,levels=c("AM Rush", "Mid-Day", "PM Rush", "Overnight"))) +
plotTheme() +
labs(title = "Delays by Time of Day",
x = "Number of Trains Late",
caption = "Figure X")
# B <-
df %>%
mutate(delay_grtr5min = case_when(delay_minutes >= 5 ~ 1,
delay_minutes < 5 ~ 0)) %>%
filter(delay_grtr5min == 1) %>%
group_by(interval60) %>%
ggplot() +
geom_freqpoly(aes(hour(scheduled_datetime), color = LINE_NAME), size = 1, binwidth = 1) +
facet_wrap(~LINE_NAME) +
plotTheme() +
labs(title = "Count of Lateness by Line by Time",
x = "Scheduled Hour of Day",
caption = "Figure X")
# grid.arrange(A,B, ncol = 2, nrow=1, widths = c(1.5,2),
# top = textGrob("Plots showing Usage Over Time", gp=gpar(fontsize=20,font=3)))
Next we analyzed day within the time of the year, rather than hours within a day. Figure X shows the amount of delay over the course of a year by day. The dotted line in the figure indicates the first day covid-19 was detected in the United States. Based on the data, it is difficult to determine the role that covid-19 had on average delay times. To better understand the role of covid-19, data from year prior to covid will need to be used in a t-test.
#df %>%
# filter(date == as.numeric(ymd(c("2020-01-20"))))
covid_day <- 691 # for full dataset
covid_day <- 20
animate(
df_shp %>%
st_drop_geometry() %>%
group_by(day,LINE_NAME) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
na.omit() %>%
ggplot(aes(x = day, y = mean_delay, color = LINE_NAME)) +
geom_line(se = FALSE) +
geom_vline(xintercept = covid_day, linetype = "dashed") +
labs(title = "Average Delay Over Time",
subtitle = "By Line",
y = "Average Delay",
x = "Day") +
plotTheme() +
theme(legend.position = "hide") +
facet_wrap(~LINE_NAME)+
transition_reveal(day),
width = 800, height = 800)
#anim_save(filename = "LINE_NAME.gif",animation = last_animation())
In addition to the animated plots, Figure X also includes average delays that has been smoothed using geom_smooth rather than geom_line. geom_smooth estimates a best fit line rather than directly plotting and following every single point.
ggplotly(df_shp %>%
st_drop_geometry() %>%
group_by(date,LINE_NAME) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
ggplot(aes(x = date, y = mean_delay, color = LINE_NAME)) +
geom_smooth(se = FALSE) +
geom_vline(xintercept = as.numeric(ymd(c("2020-01-20"))), linetype = "dashed") +
labs(title = "Average Delay by Day of Year",
subtitle = "By Line",
y = "Average Delay",
x = "Date",
caption = "Figure X") +
plotTheme())
#ggsave("smooth_line_delay.jpg", width = 16, height = 9, unit = "in", dpi = 300)
Finally, figure X, includes an estimated line for delays by stop sequence. Stop sequence indicates where on the line this train has stopped; for example, the North East Corridor may have multiple stops and those stops are organized by values 1, 2, 3, and so forth.This figure is particularly interesting because it shows that delays accumulate in the system. Most trains at the beginning of their route leave on time, whereas as trains experience delays further into their route.
factor_order <- c(1:26)
animate(df_shp %>%
st_drop_geometry() %>%
group_by(day,stop_sequence) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(stop_sequence = factor(stop_sequence, levels = factor_order)) %>%
ggplot(aes(x = day, y = mean_delay, color = stop_sequence)) +
geom_line(se = FALSE) +
geom_vline(xintercept = covid_day, linetype = "dashed") +
labs(title = "Average Delay by Day of Year",
subtitle = "By Line",
y = "Average Delay",
x = "Day") +
plotTheme() +
facet_wrap(~stop_sequence) +
theme(legend.position = "hide",
axis.ticks.x = element_blank()) +
transition_reveal(day),
width = 800, height = 800)
#anim_save(filename = "stop_sequence.gif",animation = last_animation())
Similarly to the analysis by Line Name, figure x below is a smoothed out using geom_smooth rather than geom_line
ggplotly(df_shp %>%
st_drop_geometry() %>%
group_by(date,stop_sequence) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(stop_sequence = factor(stop_sequence, levels = factor_order)) %>%
ggplot(aes(x = date, y = mean_delay, color = stop_sequence)) +
geom_smooth(se = FALSE) +
geom_vline(xintercept = as.numeric(ymd(c("2020-01-20"))), linetype = "dashed") +
labs(title = "Average Delay Over Time",
subtitle = "By Stop Sequence",
y = "Average Delay",
x = "Date",
caption = "Figure X") +
plotTheme())
#ggsave("smooth_Stop_delay.jpg", width = 16, height = 9, unit = "in", dpi = 300)
Rather than just analyze delays by time of day, we also wanted to understand the role that stations had in delay times for trains. Below we included a plot that shows the average delay to and from a station. This figure is particularly useful for agencies as it allows for the identification of specific stations that are causing and being a recipient of delays in the system.
ggplotly(ggplot(data = cat %>%
filter(Variable %in% c("from","to")) %>%
arrange(Value) %>%
mutate(Value = as.factor(Value)),
aes(x = Value, y = delay_minutes, fill = Variable)) +
geom_bar(position = "dodge2", stat = "summary", fun.y = "mean") +
scale_y_continuous(labels = comma) +
plotTheme() +
labs(x = "Station Names",
y = "Delay in Minutes",
title = "Cateogrical Variables Average Delay",
caption = "Figure X") +
theme(axis.title.x = element_blank()) +
coord_flip())
#ggsave("Station_Delay.jpg", width = 16*1.5, height = 9*1.5, unit = "in", dpi = 300)
In addition to analysis on Time, this study also includes space. The maps below of the NJ transit lines show the average delow going to and from each station on the NJ Transit. As illustrated below, the average train on the Atlantic City Line is around 15 minutes late.
A <- df_shp %>%
st_drop_geometry() %>%
group_by(from) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10))
A <- length(unique(A$Bin_mean_delay))
cc <- scales::seq_gradient_pal("Red","Grey")(seq(0,1,length.out=A))
#df_shp %>%
# st_drop_geometry() %>%
# group_by(from) %>%
# summarize(mean_delay = mean(delay_minutes)) %>%
# mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10),
# Bin_mean_delay = str_remove(Bin_mean_delay,"."),
# Bin_mean_delay = str_sub(Bin_mean_delay,1,nchar(Bin_mean_delay)-1)) %>%
# separate(Bin_mean_delay, into = c("Start","End"), sep = ",") %>%
# mutate(Start = round(as.numeric(Start),3),
# End = round(as.numeric(End),3)) %>%
# arrange(Start) %>%
# mutate(Start = as.character(Start),
# End = as.character(End),
# Interval = paste0(Start," - ",End),
# Interval = forcats::as_factor(Interval)) %>%
# rename(STATION_ID = from) %>%
# left_join(NJtransitstations, by = "STATION_ID") %>%
# st_as_sf() %>%
# ggplot() +
# geom_sf(data = NJTransitLines) +
# geom_sf(aes(color = Interval)) +
# geom_sf(data = NJshp, fill = "transparent") +
# scale_color_manual(values = cc) +
# labs(title = "Delay in minutes going from a Station") +
# mapTheme()
B <- df_shp %>%
st_drop_geometry() %>%
group_by(from) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
rename(STATION_ID = from) %>%
left_join(NJtransitstations, by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = mean_delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "From a Station") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
#ggsave("Delay_from_station.jpg", width = 16, height = 9, unit = "in", dpi = 300)
A <- df_shp %>%
st_drop_geometry() %>%
group_by(to) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10))
A <- length(unique(A$Bin_mean_delay))
cc <- scales::seq_gradient_pal("Red","Grey")(seq(0,1,length.out=A))
#df_shp %>%
# st_drop_geometry() %>%
# group_by(to) %>%
# summarize(mean_delay = mean(delay_minutes)) %>%
# mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10),
# Bin_mean_delay = str_remove(Bin_mean_delay,"."),
# Bin_mean_delay = str_sub(Bin_mean_delay,1,nchar(Bin_mean_delay)-1)) %>%
# separate(Bin_mean_delay, into = c("Start","End"), sep = ",") %>%
# mutate(Start = as.numeric(Start)) %>%
# arrange(Start) %>%
# mutate(Start = as.character(Start),
# Interval = paste0(Start," - ",End),
# Interval = forcats::as_factor(Interval)) %>%
# rename(STATION_ID = to) %>%
# left_join(NJtransitstations, by = "STATION_ID") %>%
# st_as_sf() %>%
# ggplot() +
# geom_sf(data = NJTransitLines) +
# geom_sf(aes(color = Interval)) +
# geom_sf(data = NJshp, fill = "transparent") +
# scale_color_manual(values = cc) +
# labs(title = "Delay in minutes going to a Station") +
# mapTheme()
#ggsave("Delay_to_station.jpg", width = 16*.5, height = 9*.5, unit = "in", dpi = 300)
C <- df_shp %>%
st_drop_geometry() %>%
group_by(to) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
rename(STATION_ID = to) %>%
left_join(NJtransitstations, by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = mean_delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "To a Station") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
#ggsave("Delay_to_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
grid.arrange(C,B, ncol = 2, nrow=1,
top = textGrob("Average Delay To and From Station", gp=gpar(fontsize=20,font=3)))
After exploring the data, we built a linear model that includes stop_sequence, from, to, delay_minutes, week, DotW, Start_time, Year, station_lag, scheduled_TOD, day, Temperature, Precipitation, Wind_Speed, LINE_NAME. Compared to a model that includes all the variables, this model is more advantageous because it avoids the problem of being overfit. When a model was built that included all the variables, the model had an R^2 value of 1 and 0 Mean Squared Error (MSE). The current model includes an MSE not equal to 0 and a R^2 value less than 1, which means the model is not overfit, and can predict on new data.
Once we identified the suitable predictors for the model, the data were then subsetted into training and test datasets. The training set are the data on which the model is built and consists of 14 weeks of the original dataset. The model is then evaluated to determine its effectiveness using the test set consisting of the 4 weeks that follow the training set.
The table below contains model summary and coefficients and corresponding p-values for our model.
#reg.training <- fastLm(delay_minutes ~ ., data = Train)
reg.training <- lm(delay_minutes ~ ., data = Train)
#summary(reg.training) Our R squared is 1. Definitely an overfit model. We'll see how well it predicts though.
f <- predict(reg.training, Test)
Test <- bind_cols(Test, as_tibble(f)) %>%
rename(Delay.Predict = value) %>%
mutate(Delay.Error = Delay.Predict - delay_minutes,
Delay.AbsError = abs(Delay.Predict - delay_minutes),
Delay.APE = (abs(Delay.Predict - delay_minutes)) / Delay.Predict)
mae <- mean(Test$Delay.AbsError, na.rm=TRUE)
#print(mae)
mape <- mean(Test$Delay.APE, na.rm=TRUE)
#print(mape)
# Table of regression outputs
#modelsummary(reg.training, output = "kableExtra", stars = TRUE, gof_omit = 'IC|Log|AIC|BIC')
stargazer(reg.training, type="html", header = TRUE, align=TRUE,
single.row = TRUE, style = "apsr",
notes.append = FALSE,
notes = c("<sup>⋆</sup>p<0.1; <sup>⋆⋆</sup>p<0.05; <sup>⋆⋆⋆</sup>p<0.01"))
| delay_minutes | |
| stop_sequence | -0.023*** (0.001) |
| fromAbsecon | 3.077*** (0.214) |
| fromAllendale | 1.224*** (0.128) |
| fromAllenhurst | 0.245 (0.177) |
| fromAnderson Street | 2.917*** (0.147) |
| fromAnnandale | 1.813*** (0.267) |
| fromAsbury Park | -0.208 (0.146) |
| fromAtco | 3.199*** (0.897) |
| fromAtlantic City Rail Terminal | 4.444*** (0.905) |
| fromAvenel | 1.014*** (0.098) |
| fromBasking Ridge | 0.596*** (0.190) |
| fromBay Head | -0.309 (0.211) |
| fromBay Street | 1.511*** (0.117) |
| fromBelmar | 0.108 (0.170) |
| fromBerkeley Heights | 2.249*** (0.146) |
| fromBernardsville | -2.223*** (0.200) |
| fromBloomfield | 0.655*** (0.110) |
| fromBoonton | 2.519*** (0.177) |
| fromBound Brook | -0.507*** (0.150) |
| fromBradley Beach | 0.092 (0.197) |
| fromBrick Church | 0.373*** (0.100) |
| fromBridgewater | -0.749*** (0.148) |
| fromBroadway Fair Lawn | 0.542*** (0.133) |
| fromCampbell Hall | 1.086*** (0.191) |
| fromChatham | 0.347*** (0.113) |
| fromCherry Hill | 3.102*** (0.899) |
| fromClifton | 2.108*** (0.136) |
| fromConvent Station | 0.231* (0.120) |
| fromCranford | -0.912*** (0.113) |
| fromDelawanna | 1.605*** (0.130) |
| fromDenville | 1.168*** (0.128) |
| fromDover | 2.008*** (0.135) |
| fromDunellen | -0.785*** (0.136) |
| fromEast Orange | -0.210** (0.100) |
| fromEdison | 1.169*** (0.097) |
| fromEgg Harbor City | 4.769*** (0.900) |
| fromElberon | 0.370*** (0.118) |
| fromElizabeth | 0.754*** (0.089) |
| fromEmerson | 2.603*** (0.162) |
| fromEssex Street | 1.518*** (0.135) |
| fromFanwood | -0.797*** (0.125) |
| fromFar Hills | 2.807*** (0.207) |
| fromGarfield | 1.274*** (0.130) |
| fromGarwood | -0.485*** (0.123) |
| fromGillette | 0.170 (0.154) |
| fromGladstone | 2.906*** (0.225) |
| fromGlen Ridge | 1.119*** (0.115) |
| fromGlen Rock Boro Hall | -0.071 (0.132) |
| fromGlen Rock Main Line | -0.174 (0.137) |
| fromHackettstown | 3.954*** (0.257) |
| fromHamilton | 3.270*** (0.107) |
| fromHammonton | 1.023*** (0.193) |
| fromHarriman | 2.606*** (0.166) |
| fromHawthorne | 1.352*** (0.134) |
| fromHazlet | 1.093*** (0.124) |
| fromHigh Bridge | 1.221*** (0.292) |
| fromHighland Avenue | 0.621*** (0.106) |
| fromHillsdale | 2.083*** (0.159) |
| fromHoboken | 0.865*** (0.102) |
| fromJersey Avenue | 1.934*** (0.106) |
| fromKingsland | 0.761*** (0.116) |
| fromLake Hopatcong | 0.918*** (0.179) |
| fromLebanon | 0.690*** (0.249) |
| fromLincoln Park | 2.769*** (0.192) |
| fromLinden | 1.801*** (0.091) |
| fromLindenwold | 2.096*** (0.170) |
| fromLittle Falls | 0.909*** (0.172) |
| fromLittle Silver | -0.184** (0.090) |
| fromLong Branch | 2.954*** (0.154) |
| fromLyndhurst | 0.768*** (0.139) |
| fromLyons | -0.599*** (0.184) |
| fromMadison | 0.502*** (0.117) |
| fromMahwah | 0.619*** (0.131) |
| fromManasquan | -1.364*** (0.192) |
| fromMaplewood | 0.052 (0.106) |
| fromMetropark | 1.353*** (0.092) |
| fromMetuchen | 0.986*** (0.095) |
| fromMiddletown NJ | -0.226*** (0.063) |
| fromMiddletown NY | 1.709*** (0.205) |
| fromMillburn | 0.117 (0.104) |
| fromMillington | 1.903*** (0.173) |
| fromMontclair Heights | -0.017 (0.138) |
| fromMontclair State U | 1.639*** (0.139) |
| fromMontvale | 2.121*** (0.166) |
| fromMorris Plains | 0.200 (0.124) |
| fromMorristown | 0.958*** (0.122) |
| fromMount Arlington | 1.446*** (0.157) |
| fromMount Olive | 1.602*** (0.229) |
| fromMount Tabor | 0.652*** (0.129) |
| fromMountain Avenue | 0.407*** (0.135) |
| fromMountain Lakes | 0.577*** (0.154) |
| fromMountain Station | 0.349*** (0.107) |
| fromMountain View | 0.302 (0.193) |
| fromMurray Hill | 0.003 (0.133) |
| fromNanuet | 2.670*** (0.180) |
| fromNetcong | 2.476*** (0.206) |
| fromNetherwood | -1.329*** (0.129) |
| fromNew Bridge Landing | 1.454*** (0.145) |
| fromNew Brunswick | 1.286*** (0.098) |
| fromNew Providence | 1.817*** (0.123) |
| fromNew York Penn Station | 1.474*** (0.093) |
| fromNewark Airport | 0.896*** (0.089) |
| fromNewark Broad Street | 0.418*** (0.096) |
| fromNewark Penn Station | 1.509*** (0.088) |
| fromNorth Branch | 3.107*** (0.200) |
| fromNorth Elizabeth | 1.291*** (0.093) |
| fromOradell | 1.899*** (0.156) |
| fromOrange | 0.151 (0.102) |
| fromOtisville | -0.233 (0.214) |
| fromPark Ridge | 2.693*** (0.158) |
| fromPassaic | -0.244* (0.145) |
| fromPaterson | 0.044 (0.144) |
| fromPeapack | -2.785*** (0.217) |
| fromPearl River | 1.704*** (0.161) |
| fromPennsauken | 1.835*** (0.142) |
| fromPerth Amboy | 0.747*** (0.061) |
| fromPhiladelphia | 3.132*** (0.904) |
| fromPlainfield | -0.262* (0.139) |
| fromPlauderville | 1.207*** (0.130) |
| fromPoint Pleasant Beach | 2.819*** (0.233) |
| fromPort Jervis | 3.405*** (0.228) |
| fromPrinceton Junction | 0.763*** (0.101) |
| fromRadburn Fair Lawn | 1.926*** (0.128) |
| fromRahway | 1.828*** (0.088) |
| fromRamsey Main St | 1.149*** (0.131) |
| fromRamsey Route 17 | 0.374*** (0.125) |
| fromRaritan | -0.775*** (0.163) |
| fromRed Bank | 1.956*** (0.140) |
| fromRidgewood | 1.467*** (0.125) |
| fromRiver Edge | 2.857*** (0.157) |
| fromRoselle Park | -0.305*** (0.106) |
| fromRutherford | 1.901*** (0.116) |
| fromSalisbury Mills-Cornwall | 0.054 (0.181) |
| fromSecaucus Lower Lvl | 2.194*** (0.125) |
| fromSecaucus Upper Lvl | 0.166* (0.091) |
| fromShort Hills | 0.387*** (0.109) |
| fromSloatsburg | 1.725*** (0.145) |
| fromSomerville | -0.158 (0.158) |
| fromSouth Amboy | 1.445*** (0.107) |
| fromSouth Orange | 0.403*** (0.103) |
| fromSpring Lake | 1.923*** (0.215) |
| fromSpring Valley | 2.454*** (0.182) |
| fromStirling | -0.693*** (0.166) |
| fromSuffern | 1.150*** (0.132) |
| fromSummit | 1.047*** (0.107) |
| fromTeterboro | 2.298*** (0.140) |
| fromTowaco | 0.030 (0.188) |
| fromTrenton | -0.869*** (0.112) |
| fromTuxedo | 0.281* (0.159) |
| fromUnion | 0.033 (0.108) |
| fromUpper Montclair | 0.520*** (0.129) |
| fromWaldwick | 0.036 (0.131) |
| fromWalnut Street | 0.740*** (0.123) |
| fromWatchung Avenue | 0.919*** (0.126) |
| fromWatsessing Avenue | 1.538*** (0.105) |
| fromWayne-Route 23 | 2.048*** (0.182) |
| fromWesmont | 0.996*** (0.132) |
| fromWestfield | -1.042*** (0.117) |
| fromWestwood | 2.401*** (0.158) |
| fromWhite House | -1.472*** (0.220) |
| fromWood Ridge | 1.961*** (0.123) |
| fromWoodbridge | 0.476*** (0.091) |
| fromWoodcliff Lake | 2.269*** (0.163) |
| toAbsecon | -2.532*** (0.894) |
| toAllendale | 1.020*** (0.131) |
| toAllenhurst | 1.351*** (0.176) |
| toAnderson Street | 0.672*** (0.144) |
| toAnnandale | 0.803*** (0.269) |
| toAsbury Park | 0.669*** (0.146) |
| toAtco | -1.073*** (0.131) |
| toAtlantic City Rail Terminal | -3.178*** (0.184) |
| toAvenel | -0.381*** (0.108) |
| toBasking Ridge | 3.771*** (0.194) |
| toBay Head | -3.405*** (0.211) |
| toBay Street | 0.113 (0.121) |
| toBelmar | 1.292*** (0.170) |
| toBerkeley Heights | 0.496*** (0.149) |
| toBernardsville | -1.539*** (0.202) |
| toBloomfield | -0.616*** (0.114) |
| toBoonton | 0.646*** (0.180) |
| toBound Brook | 1.809*** (0.146) |
| toBradley Beach | 1.095*** (0.196) |
| toBrick Church | 0.829*** (0.105) |
| toBridgewater | 1.749*** (0.157) |
| toBroadway Fair Lawn | -0.461*** (0.135) |
| toCampbell Hall | 0.477** (0.193) |
| toChatham | 0.400*** (0.117) |
| toCherry Hill | -0.282*** (0.093) |
| toClifton | 2.405*** (0.139) |
| toConvent Station | 0.199 (0.124) |
| toCranford | 1.500*** (0.119) |
| toDelawanna | 1.823*** (0.134) |
| toDenville | 0.264** (0.132) |
| toDover | -0.184 (0.138) |
| toDunellen | 1.351*** (0.148) |
| toEast Orange | 0.514*** (0.105) |
| toEdison | -0.151 (0.104) |
| toEgg Harbor City | 0.183 (0.160) |
| toElberon | 1.503*** (0.118) |
| toElizabeth | -0.179* (0.095) |
| toEmerson | 0.365** (0.162) |
| toEssex Street | 0.422*** (0.142) |
| toFanwood | 1.809*** (0.130) |
| toFar Hills | 3.949*** (0.210) |
| toGarfield | 0.974*** (0.134) |
| toGarwood | 1.368*** (0.125) |
| toGillette | 1.231*** (0.159) |
| toGladstone | 2.616*** (0.227) |
| toGlen Ridge | 0.115 (0.119) |
| toGlen Rock Boro Hall | 0.684*** (0.135) |
| toGlen Rock Main Line | 0.805*** (0.139) |
| toHackettstown | -2.431*** (0.260) |
| toHamilton | 0.653*** (0.110) |
| toHammonton | -2.700*** (0.889) |
| toHarriman | 2.335*** (0.169) |
| toHawthorne | 1.745*** (0.137) |
| toHazlet | 1.073*** (0.124) |
| toHigh Bridge | -3.243*** (0.286) |
| toHighland Avenue | 0.947*** (0.111) |
| toHillsdale | -0.110 (0.161) |
| toHoboken | -1.953*** (0.106) |
| toJersey Avenue | -1.313*** (0.114) |
| toKingsland | 1.092*** (0.121) |
| toLake Hopatcong | -1.600*** (0.182) |
| toLebanon | 0.164 (0.243) |
| toLincoln Park | 0.692*** (0.196) |
| toLinden | 0.714*** (0.097) |
| toLindenwold | -0.843 (0.887) |
| toLittle Falls | -0.585*** (0.164) |
| toLittle Silver | -0.174* (0.090) |
| toLong Branch | -1.356*** (0.154) |
| toLyndhurst | 0.709*** (0.141) |
| toLyons | -0.954*** (0.186) |
| toMadison | 0.680*** (0.122) |
| toMahwah | 1.933*** (0.133) |
| toManasquan | -1.653*** (0.191) |
| toMaplewood | 0.673*** (0.109) |
| toMetropark | 0.511*** (0.098) |
| toMetuchen | -0.113 (0.099) |
| toMiddletown NJ | -0.435*** (0.063) |
| toMiddletown NY | 1.819*** (0.204) |
| toMillburn | 0.639*** (0.112) |
| toMillington | 1.678*** (0.177) |
| toMontclair Heights | 0.704*** (0.139) |
| toMontclair State U | -0.108 (0.146) |
| toMontvale | -0.489*** (0.165) |
| toMorris Plains | -0.003 (0.127) |
| toMorristown | 0.950*** (0.126) |
| toMount Arlington | 0.281* (0.163) |
| toMount Olive | -1.638*** (0.231) |
| toMount Tabor | -0.178 (0.133) |
| toMountain Avenue | 0.773*** (0.139) |
| toMountain Lakes | -1.117*** (0.168) |
| toMountain Station | 0.560*** (0.111) |
| toMountain View | -0.584*** (0.192) |
| toMurray Hill | -1.010*** (0.138) |
| toNanuet | 1.115*** (0.177) |
| toNetcong | 0.207 (0.207) |
| toNetherwood | 1.509*** (0.136) |
| toNew Bridge Landing | -0.620*** (0.156) |
| toNew Brunswick | 0.373*** (0.101) |
| toNew Providence | -0.195 (0.126) |
| toNew York Penn Station | -0.912*** (0.097) |
| toNewark Airport | 0.561*** (0.094) |
| toNewark Broad Street | 1.477*** (0.101) |
| toNewark Penn Station | 0.240** (0.094) |
| toNorth Branch | 1.579*** (0.192) |
| toNorth Elizabeth | 0.401*** (0.098) |
| toOradell | -1.137*** (0.163) |
| toOrange | 0.315*** (0.107) |
| toOtisville | 0.040 (0.216) |
| toPark Ridge | 0.642*** (0.163) |
| toPassaic | 0.280* (0.147) |
| toPaterson | 0.561*** (0.146) |
| toPeapack | -1.041*** (0.218) |
| toPearl River | 0.504*** (0.170) |
| toPennsauken | -1.374 (0.892) |
| toPerth Amboy | 0.404*** (0.061) |
| toPhiladelphia | |
| toPlainfield | 1.850*** (0.137) |
| toPlauderville | 1.215*** (0.134) |
| toPoint Pleasant Beach | 2.145*** (0.232) |
| toPort Jervis | 3.954*** (0.227) |
| toPrinceton Junction | -0.757*** (0.106) |
| toRadburn Fair Lawn | 1.719*** (0.131) |
| toRahway | -0.806*** (0.091) |
| toRamsey Main St | 1.308*** (0.132) |
| toRamsey Route 17 | 1.538*** (0.127) |
| toRaritan | -0.494*** (0.167) |
| toRed Bank | 2.085*** (0.139) |
| toRidgewood | 2.046*** (0.127) |
| toRiver Edge | 0.592*** (0.156) |
| toRoselle Park | 2.147*** (0.110) |
| toRutherford | 1.160*** (0.120) |
| toSalisbury Mills-Cornwall | 1.588*** (0.180) |
| toSecaucus Lower Lvl | 1.262*** (0.128) |
| toSecaucus Upper Lvl | 1.326*** (0.096) |
| toShort Hills | 0.452*** (0.111) |
| toSloatsburg | 0.724*** (0.148) |
| toSomerville | 2.022*** (0.159) |
| toSouth Amboy | 1.010*** (0.107) |
| toSouth Orange | 0.783*** (0.109) |
| toSpring Lake | 1.788*** (0.215) |
| toSpring Valley | -1.454*** (0.190) |
| toStirling | -0.973*** (0.168) |
| toSuffern | 0.445*** (0.130) |
| toSummit | -0.123 (0.113) |
| toTeterboro | 0.297** (0.142) |
| toTowaco | -1.571*** (0.193) |
| toTrenton | -3.741*** (0.117) |
| toTuxedo | 0.376** (0.157) |
| toUnion | 1.204*** (0.112) |
| toUpper Montclair | 0.610*** (0.133) |
| toWaldwick | 0.572*** (0.133) |
| toWalnut Street | -0.181 (0.126) |
| toWatchung Avenue | 0.407*** (0.130) |
| toWatsessing Avenue | 0.326*** (0.110) |
| toWayne-Route 23 | 0.594*** (0.186) |
| toWesmont | 0.415*** (0.134) |
| toWestfield | 1.872*** (0.121) |
| toWestwood | 0.350** (0.166) |
| toWhite House | -0.640*** (0.224) |
| toWood Ridge | 1.024*** (0.126) |
| toWoodbridge | 0.189* (0.101) |
| toWoodcliff Lake | 0.409** (0.167) |
| week | -0.041*** (0.011) |
| DotW | 0.003 (0.002) |
| Start_time | 0.00000*** (0.00000) |
| Year | |
| station_lag | 0.970*** (0.0004) |
| scheduled_TODMid-Day | -0.065*** (0.009) |
| scheduled_TODOvernight | -0.064*** (0.009) |
| scheduled_TODPM Rush | -0.095*** (0.010) |
| day | 0.004*** (0.002) |
| Temperature | 0.002*** (0.0003) |
| Precipitation | -0.010 (0.015) |
| Wind_Speed | 0.002*** (0.0005) |
| LINE_NAMEBergen County Line | -1.029*** (0.115) |
| LINE_NAMEGladstone Branch | 0.174*** (0.060) |
| LINE_NAMEMain Line | -0.979*** (0.115) |
| LINE_NAMEMontclair-Boonton Line | 0.011 (0.060) |
| LINE_NAMEMorristown Line | 0.086 (0.058) |
| LINE_NAMENJ Coast Line | -0.283*** (0.051) |
| LINE_NAMENortheast Corridor | -0.366*** (0.051) |
| LINE_NAMEPascack Valley Line | -1.368*** (0.120) |
| LINE_NAMERaritan Valley Line | |
| Constant | -0.712*** (0.128) |
| N | 642,197 |
| R2 | 0.929 |
| Adjusted R2 | 0.929 |
| Residual Std. Error | 2.163 (df = 641853) |
| F Statistic | 24,579.300*** (df = 343; 641853) |
| ⋆p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01 | |
Based on the linear regression we found the Mean Absolute Error (MAE) to be and the Mean Absolute Percent Error (MAPE) to be . These ‘goodness of fit’ metrics are used to measure how closely the model can predict the delays of trains in the test dataset. The MAE can be interpreted as the average number of minutes that the model over or under predicts compared to the actual delay at each stop. If our minimum delay time is 0, and maximum delay time is , then an MAE of is actually pretty good. Further, the MAPE is interpreted as the percentage of trains that the model predicted incorrectly.
To ensure build a robust model, we utilized the K-fold cross-validation technique to repeat the modeling process detailed above. In essence, cross-validation process involves partitioning the dataset into K number groups (otherwise known as folds). For each fold, the modeling process described above is repeated - split into training/test sets, build model on training, calculate goodness of fit metrics for test set. The repetition of the modeling process allows us to simulate how our model would perform on unseen data in a real world setting.
fitControl <- trainControl(method = "cv", number = 5)
set.seed(825)
reg.cv <-
train(delay_minutes ~ ., data = Train,
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 698456 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 558766, 558765, 558764, 558765, 558764
## Resampling results:
##
## RMSE Rsquared MAE
## 2.152509 0.9295134 1.090011
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The results of our cross-validation indicate that the model is an effective method for predicting NJ Transit train lateness. The RMSE value of 2.1525094 indicates that on average, the model either over-predicts or under-predicts lateness by about 2 minutes. Additionally, the R^2 value of 0.9295134 indicates that the model may have some over fit. However, these values only give impressions about the model as a whole. To identify stations for which the model is better or worse at predicting, we must visualize the errors geographically.
First we wanted to visualize the average predicted delay times by geography. Figure X includes the predicted delay by line. Similar to Figure X, the predicted delays are very similarly spread out geographically.
Test %>%
na.omit() %>%
group_by(LINE_NAME) %>%
summarize(Mean_Delay = mean(Delay.Predict)) %>%
left_join(NJTransitLines, by = "LINE_NAME") %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(color = Mean_Delay,
linetype = LINE_NAME)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "Estimated Delay by Line",
caption = "Figure X") +
scale_color_gradient(low = "green",
high = "red") +
mapTheme()
ggsave("Modeled_Delay_Line.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
Additionally, we’ve included the predicted delay TO and From each station by geographic location.
B <- Test %>%
group_by(from) %>%
summarize(Mean_Delay = mean(Delay.Predict)) %>%
rename(STATION_ID = from) %>%
left_join(NJtransitstations %>%
select(STATION_ID,LATITUDE,LONGITUDE,geometry), by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = Mean_Delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "From",
x = "Predicted Mean Delay") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
ggsave("Modeled_Delay_from_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
C <- Test %>%
group_by(to) %>%
summarize(Mean_Delay = mean(Delay.Predict)) %>%
rename(STATION_ID = to) %>%
left_join(NJtransitstations %>%
select(STATION_ID,LATITUDE,LONGITUDE,geometry), by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = Mean_Delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "To",
x = "Predicted Mean Delay") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
ggsave("Modeled_Delay_to_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
grid.arrange(C,B, ncol = 2, nrow=1,
top = textGrob("Average Predicted Delay To and From Station", gp=gpar(fontsize=20,font=3)))
Although the model accounts for most error, we wanted to check how the well the model accounts for differences in geography. It is clear that while the model is good a predicting values close to the average delay, it loses predictive power for the Atlantic City Line with is consistently more late than the other NJ Transit lines.
Test %>%
na.omit() %>%
group_by(LINE_NAME) %>%
summarize(Mean_AbsError = mean(Delay.AbsError)) %>%
left_join(NJTransitLines, by = "LINE_NAME") %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(color = Mean_AbsError,
linetype = LINE_NAME)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "Estimated Delay by Line") +
scale_color_gradient(low = "green",
high = "red") +
mapTheme()
ggsave("Modeled_Delay_Line.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
B <- Test %>%
group_by(from) %>%
summarize(Mean_AbsError = mean(Delay.AbsError)) %>%
rename(STATION_ID = from) %>%
left_join(NJtransitstations %>%
select(STATION_ID,LATITUDE,LONGITUDE,geometry), by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = Mean_AbsError)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "From",
x = "Predicted Mean Delay") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
ggsave("Modeled_Delay_from_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
C <- Test %>%
group_by(to) %>%
summarize(Mean_AbsError = mean(Delay.AbsError)) %>%
rename(STATION_ID = to) %>%
left_join(NJtransitstations %>%
select(STATION_ID,LATITUDE,LONGITUDE,geometry), by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = Mean_AbsError)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "To",
x = "Predicted Mean Delay") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
ggsave("Modeled_Delay_to_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
grid.arrange(C,B, ncol = 2, nrow=1,
top = textGrob("Average of predicted Delay To and From Station", gp=gpar(fontsize=20,font=3)))
It is apparent that this is a very accurate model. Overall the model predicts how late any given NJ transit train will be to a station with a 92.95% accuracy. Further, it predicts well for the commuter lines such as The Main Line, The Raritan Valley Line, and the Morristown Line. However, the fact that the model predicts well for all lines except the Atlantic City Line may indicate an issue with overfitting and the inability to generalize to different spatial contexts. With such a robust set of data, it is unlikely that we would see issues with predicting across the temporal axis though.
Therefore, we recommend that NJ Transit make use of this tool for predicting when and where their trains will be late. In turn, this tool will then give operators the ability to work towards mitigation of the drivers of lateness in their system. This is a possible avenue for further expansion of this tool’s capabilities. If we were given access to NJ Transit’s mechanical issues data, provided that it exists, we could identify when and where mechanical issues cause lateness.
It is clear that this is a powerful tool, not only for the transit operators, but also for the average NJ Transit rider. The ability to predict when a train will arrive with a margin of error of minutes could be highly attractive to the time-crunched transit rider.
Regarding improvements to the model, in order to transition the model to an agency outside of NJ Transit, we could remove the line name and stop names predictors and instead use the distance between each station. The idea being that if stations are relatively close to each other, delays accumulate as trains cannot speed up to regain time. These changes could also be useful for the NJ Transit deployment as well.
Freishtat, S. (2021). Waiting longer for Follow your bus or train? The CTA has been running fewer of them during the pandemic. 5. Lazo, L. (n.d.). Amtrak’s chronic delays are costing millions of dollars, report says. 2. Penney, V. (2021). How Coronavirus Has Changed New York City Transit, in One Chart. 7.